Search CORE

79 research outputs found

PIRSF Family Classification System for Protein Functional and Evolutionary Analysis

Author: Arighi Cecilia N.
Barker Winona C.
Huang Hongzhan
Nikolskaya Anastasia N.
Wu Cathy H.
Publication venue: Libertas Academica
Publication date: 01/01/2006
Field of study

The PIRSF protein classification system (http://pir.georgetown.edu/pirsf/) reflects evolutionary relationships of full-length proteins and domains. The primary PIRSF classification unit is the homeomorphic family, whose members are both homologous (evolved from a common ancestor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture). PIRSF families are curated systematically based on literature review and integrative sequence and functional analysis, including sequence and structure similarity, domain architecture, functional association, genome context, and phyletic pattern. The results of classification and expert annotation are summarized in PIRSF family reports with graphical viewers for taxonomic distribution, domain architecture, family hierarchy, and multiple alignment and phylogenetic tree. The PIRSF system provides a comprehensive resource for bioinformatics analysis and comparative studies of protein function and evolution. Domain or fold-based searches allow identification of evolutionarily related protein families sharing domains or structural folds. Functional convergence and functional divergence are revealed by the relationships between protein classification and curated family functions. The taxonomic distribution allows the identification of lineage-specific or broadly conserved protein families and can reveal horizontal gene transfer. Here we demonstrate, with illustrative examples, how to use the web-based PIRSF system as a tool for functional and evolutionary studies of protein families

Directory of Open Access Journals

PubMed Central

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task

Author: Arighi Cecilia N.
Chan Juancarlos
Li Yuling
Muller Hans-Michael
Van Auken Kimberly
Publication venue: 'Oxford University Press (OUP)'
Publication date: 17/01/2013
Field of study

In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV

Caltech Authors

BioRED: A Comprehensive Biomedical Relation Extraction Dataset

Author: Arighi Cecilia N
Lai Po-Ting
Lu Zhiyong
Luo Ling
Wei Chih-Hsuan
Publication venue
Publication date: 08/04/2022
Field of study

Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for bio-medical RE only focus on relations of a single type (e.g., protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then we present BioRED, a first-of-its-kind biomedical RE corpus with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene-disease; chemical-chemical), on a set of 600 PubMed articles. Further, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including BERT-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a comprehensive dataset can successfully facilitate the development of more accurate, efficient, and robust RE systems for biomedicine

arXiv.org e-Print Archive

An improved ontological representation of dendritic cells as a paradigm for all cell types

Author: Arighi Cecilia N.
Chris Mungall
Cowell Lindsay G.
Diehl Alexander D.
Lieberman Anne E.
Maria Masci Anna
Scheuermann Richard H.
Smith Barry
Publication venue
Publication date: 01/01/2009
Field of study

The Cell Ontology (CL) is designed to provide a standardized representation of cell types for data annotation. Currently, the CL employs multiple is_a relations, defining cell types in terms of histological, functional, and lineage properties, and the majority of definitions are written with sufficient generality to hold across multiple species. This approach limits the CL’s utility for cross-species data integration. To address this problem, we developed a method for the ontological representation of cells and applied this method to develop a dendritic cell ontology (DC-CL). DC-CL subtypes are delineated on the basis of surface protein expression, systematically including both species-general and species-specific types and optimizing DC-CL for the analysis of flow cytometry data. This approach brings benefits in the form of increased accuracy, support for reasoning, and interoperability with other ontology resources. 104. Barry Smith, “Toward a Realistic Science of Environments”, Ecological Psychology, 2009, 21 (2), April-June, 121-130. Abstract: The perceptual psychologist J. J. Gibson embraces a radically externalistic view of mind and action. We have, for Gibson, not a Cartesian mind or soul, with its interior theater of contents and the consequent problem of explaining how this mind or soul and its psychological environment can succeed in grasping physical objects external to itself. Rather, we have a perceiving, acting organism, whose perceptions and actions are always already tuned to the parts and moments, the things and surfaces, of its external environment. We describe how on this basis Gibson sought to develop a realist science of environments which will be ‘consistent with physics, mechanics, optics, acoustics, and chemistry’

PhilPapers

Role of the mammalian retromer in sorting of the cation-independent mannose 6-phosphate receptor

Author: Aguilar Ruben C.
Arighi Cecilia N.
Bonifacino Juan S.
Haft Carol R.
Hartnell Lisa M.
Publication venue: The Rockefeller University Press
Publication date
Field of study

The cation-independent mannose 6-phosphate receptor (CI-MPR) mediates sorting of lysosomal hydrolase precursors from the TGN to endosomes. After releasing the hydrolase precursors into the endosomal lumen, the unoccupied receptor returns to the TGN for further rounds of sorting. Here, we show that the mammalian retromer complex participates in this retrieval pathway. The hVps35 subunit of retromer interacts with the cytosolic domain of the CI-MPR. This interaction probably occurs in an endosomal compartment, where most of the retromer is localized. In particular, retromer is associated with tubular–vesicular profiles that emanate from early endosomes or from intermediates in the maturation from early to late endosomes. Depletion of retromer by RNA interference increases the lysosomal turnover of the CI-MPR, decreases cellular levels of lysosomal hydrolases, and causes swelling of lysosomes. These observations indicate that retromer prevents the delivery of the CI-MPR to lysosomes, probably by sequestration into endosome-derived tubules from where the receptor returns to the TGN

Crossref

PubMed Central

Protein Ontology: A controlled structured network of protein entities

Author: Arighi Cecilia N.
Blake Judith A.
Bult Carol J.
Christie Karen R.
Diehl Alexander D.
Drabkin Harold J.
Julie Cowart
Natale Darren A.
Olivia Helfer
Others
Peter D’Eustachio
Smith Barry
Publication venue
Publication date: 01/01/2013
Field of study

The Protein Ontology (PRO; http://proconsortium.org) formally defines protein entities and explicitly represents their major forms and interrelations. Protein entities represented in PRO corresponding to single amino acid chains are categorized by level of specificity into family, gene, sequence and modification metaclasses, and there is a separate metaclass for protein complexes. All metaclasses also have organism-specific derivatives. PRO complements established sequence databases such as UniProtKB, and interoperates with other biomedical and biological ontologies such as the Gene Ontology (GO). PRO relates to UniProtKB in that PRO’s organism-specific classes of proteins encoded by a specific gene correspond to entities documented in UniProtKB entries. PRO relates to the GO in that PRO’s representations of organism-specific protein complexes are subclasses of the organism-agnostic protein complex terms in the GO Cellular Component Ontology. The past few years have seen growth and changes to the PRO, as well as new points of access to the data and new applications of PRO in immunology and proteomics. Here we describe some of these developments

PhilPapers

Overview of the BioCreative III Workshop

Author: Arighi Cecilia N
Cohen Kevin B
Hirschman Lynette
Krallinger Martin
Lu Zhiyong
Valencia Alfonso
Wilbur W John
Wu Cathy H
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. To this end BioCreative I was held in 2004, BioCreative II in 2007, and BioCreative II.5 in 2009. Each of these workshops involved humanly annotated test data for several basic tasks in text mining applied to the biomedical literature. Participants in the workshops were invited to compete in the tasks by constructing software systems to perform the tasks automatically and were given scores based on their performance. The results of these workshops have benefited the community in several ways. They have 1) provided evidence for the most effective methods currently available to solve specific problems; 2) revealed the current state of the art for performance on those problems; 3) and provided gold standard data and results on that data by which future advances can be gauged. This special issue contains overview papers for the three tasks of BioCreative III. Results The BioCreative III Workshop was held in September of 2010 and continued the tradition of a challenge evaluation on several tasks judged basic to effective text mining in biology, including a gene normalization (GN) task and two protein-protein interaction (PPI) tasks. In total the Workshop involved the work of twenty-three teams. Thirteen teams participated in the GN task which required the assignment of EntrezGene IDs to all named genes in full text papers without any species information being provided to a system. Ten teams participated in the PPI article classification task (ACT) requiring a system to classify and rank a PubMed® record as belonging to an article either having or not having “PPI relevant” information. Eight teams participated in the PPI interaction method task (IMT) where systems were given full text documents and were required to extract the experimental methods used to establish PPIs and a text segment supporting each such method. Gold standard data was compiled for each of these tasks and participants competed in developing systems to perform the tasks automatically. BioCreative III also introduced a new interactive task (IAT), run as a demonstration task. The goal was to develop an interactive system to facilitate a user’s annotation of the unique database identifiers for all the genes appearing in an article. This task included ranking genes by importance (based preferably on the amount of described experimental information regarding genes). There was also an optional task to assist the user in finding the most relevant articles about a given gene. For BioCreative III, a user advisory group (UAG) was assembled and played an important role 1) in producing some of the gold standard annotations for the GN task, 2) in critiquing IAT systems, and 3) in providing guidance for a future more rigorous evaluation of IAT systems. Six teams participated in the IAT demonstration task and received feedback on their systems from the UAG group. Besides innovations in the GN and PPI tasks making them more realistic and practical and the introduction of the IAT task, discussions were begun on community data standards to promote interoperability and on user requirements and evaluation metrics to address utility and usability of systems. Conclusions In this paper we give a brief history of the BioCreative Workshops and how they relate to other text mining competitions in biology. This is followed by a synopsis of the three tasks GN, PPI, and IAT in BioCreative III with figures for best participant performance on the GN and PPI tasks. These results are discussed and compared with results from previous BioCreative Workshops and we conclude that the best performing systems for GN, PPI-ACT and PPI-IMT in realistic settings are not sufficient for fully automatic use. This provides evidence for the importance of interactive systems and we present our vision of how best to construct an interactive system for a GN or PPI like task in the remainder of the paper.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

TGF-beta signaling proteins and the Protein Ontology

Author: Arighi Cecilia N
Barker Winona C
Blake Judith A
Drabkin Harold
Liu Hongfang
Natale Darren A
Smith Barry
Wu Cathy H
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

BACKGROUND: The Protein Ontology (PRO) is designed as a formal and principled Open Biomedical Ontologies (OBO) Foundry ontology for proteins. The components of PRO extend from a classification of proteins on the basis of evolutionary relationships at the homeomorphic level to the representation of the multiple protein forms of a gene, including those resulting from alternative splicing, cleavage and/or post-translational modifications. Focusing specifically on the TGF-beta signaling proteins, we describe the building, curation, usage and dissemination of PRO. RESULTS: PRO is manually curated on the basis of PrePRO, an automatically generated file with content derived from standard protein data sources. Manual curation ensures that the treatment of the protein classes and the internal and external relationships conform to the PRO framework. The current release of PRO is based upon experimental data from mouse and human proteins wherein equivalent protein forms are represented by single terms. In addition to the PRO ontology, the annotation of PRO terms is released as a separate PRO association file, which contains, for each given PRO term, an annotation from the experimentally characterized sub-types as well as the corresponding database identifiers and sequence coordinates. The annotations are added in the form of relationship to other ontologies. Whenever possible, equivalent forms in other species are listed to facilitate cross-species comparison. Splice and allelic variants, gene fusion products and modified protein forms are all represented as entities in the ontology. Therefore, PRO provides for the representation of protein entities and a resource for describing the associated data. This makes PRO useful both for proteomics studies where isoforms and modified forms must be differentiated, and for studies of biological pathways, where representations need to take account of the different ways in which the cascade of events may depend on specific protein modifications. CONCLUSION: PRO provides a framework for the formal representation of protein classes and protein forms in the OBO Foundry. It is designed to enable data retrieval and integration and machine reasoning at the molecular level of proteins, thereby facilitating cross-species comparisons, pathway analysis, disease modeling and the generation of new hypotheses

Crossref

The Jackson Laboratory: The Mouseion at the JAXlibrary

Springer - Publisher Connector

PubMed Central

Overview of the COVID-19 text mining tool interactive demonstration track in BioCreative VII

Author: Arighi Cecilia N
Chatr-aryamontri Andrew
Dolinski Kara
Hirschman Lynette
Korves Tonia
Krallinger Martin
Oughtred Rose
Ross Karen E
Tyers Mike
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2022
Field of study

The coronavirus disease 2019 (COVID-19) pandemic has compelled biomedical researchers to communicate data in real time to establish more effective medical treatments and public health policies. Nontraditional sources such as preprint publications, i.e. articles not yet validated by peer review, have become crucial hubs for the dissemination of scientific results. Natural language processing (NLP) systems have been recently developed to extract and organize COVID-19 data in reasoning systems. Given this scenario, the BioCreative COVID-19 text mining tool interactive demonstration track was created to assess the landscape of the available tools and to gauge user interest, thereby providing a two-way communication channel between NLP system developers and potential end users. The goal was to inform system designers about the performance and usability of their products and to suggest new additional features. Considering the exploratory nature of this track, the call for participation solicited teams to apply for the track, based on their system’s ability to perform COVID-19-related tasks and interest in receiving user feedback. We also recruited volunteer users to test systems. Seven teams registered systems for the track, and >30 individuals volunteered as test users; these volunteer users covered a broad range of specialties, including bench scientists, bioinformaticians and biocurators. The users, who had the option to participate anonymously, were provided with written and video documentation to familiarize themselves with the NLP tools and completed a survey to record their evaluation. Additional feedback was also provided by NLP system developers. The track was well received as shown by the overall positive feedback from the participating teams and the users.National Institutes of Health Office of Research Infrastructure Programs (R01OD010929 to M.T. and K.D.); Canadian Institutes of Health Research (FDN-167277 to M.T.); Canada Research Chair in Systems and Synthetic Biology (to M.T.); National Institutes of Health (2U24HG007822-08, 1R35 GM141873-01 to K.E.R. and C.N.A); Spanish Plan for the Advancement of Language Technology and Proyectos I+D+i2020-AI4PROFHEALTH (PID2020-119266RA-I00 to M.K.); MITRE (W56KGU-18-D-0004 to L.H. and T.K.). The views, opinions and/or findings contained in this report are those of the authors and should not be construed as an official government position, policy or decision.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

PubMed Central

An improved ontological representation of dendritic cells as a paradigm for all cell types

Author: Arighi Cecilia N
Cowell Lindsay G
Diehl Alexander D
Lieberman Anne E
Masci Anna Maria
Mungall Chris
Scheuermann Richard H
Smith Barry
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Recent increases in the volume and diversity of life science data and information and an increasing emphasis on data sharing and interoperability have resulted in the creation of a large number of biological ontologies, including the Cell Ontology (CL), designed to provide a standardized representation of cell types for data annotation. Ontologies have been shown to have significant benefits for computational analyses of large data sets and for automated reasoning applications, leading to organized attempts to improve the structure and formal rigor of ontologies to better support computation. Currently, the CL employs multiple <it>is_a </it>relations, defining cell types in terms of histological, functional, and lineage properties, and the majority of definitions are written with sufficient generality to hold across multiple species. This approach limits the CL's utility for computation and for cross-species data integration. Results To enhance the CL's utility for computational analyses, we developed a method for the ontological representation of cells and applied this method to develop a dendritic cell ontology (DC-CL). DC-CL subtypes are delineated on the basis of surface protein expression, systematically including both species-general and species-specific types and optimizing DC-CL for the analysis of flow cytometry data. We avoid multiple uses of <it>is_a </it>by linking DC-CL terms to terms in other ontologies via additional, formally defined relations such as <it>has_function</it>. Conclusion This approach brings benefits in the form of increased accuracy, support for reasoning, and interoperability with other ontology resources. Accordingly, we propose our method as a general strategy for the ontological representation of cells. DC-CL is available from <url>http://www.obofoundry.org</url>.</p

PhilPapers

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central